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Abstract 

I tackle the problem of partitioning a sequence into homogeneous seg- 
ments, where homogeneity is defined by a set of Markov models. The prob- 
lem is to study the likelihood that a sequence is divided into a given number 
of segments. Here, the moments of this likelihood are computed through 
an efficient algorithm. Unlike methods involving Hidden Markov Models, 
this algorithm does not require probability transitions between the models. 
Among many possible usages of the likelihood, I present a maximum a pos- 
teriori probability criterion to predict the number of homogeneous segments 
into which a sequence can be divided, and an application of this method to 
find CpG islands. 
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1 Introduction 

An important element in analysing a sequence of letters is to find out whether 
the sequence has a structure, and if so, how it is structured. Usually, looking 
for structure in a sequence implies a partition - or segmentation - in which each 
segment can be considered "homogeneous", on the b asis of a specific criterio n. 



1998) 



There are two main approaches to tackle this problem (|Braun and Mu ller. 

A commonly used methodology is to model the sequence with Markov mod- 
els. A Markov model gives, for each word of a given length, the probabilities 
of letters conditionally following this word - called emission probabilities. The 
likelihood of a segment of letters is the product of these probabilities at all the 
positions of the segment. Various models give different likelihoods for a given 
segment, some of them greater than others. Looking for a segmentation of a se- 
quence means dividing it into segments, so that a model chosen as the best from 
amongst a set of models is attributed to each segment. One way to study the 
structure of a sequence is to analyse the set of its segmentations. 

To make this task possible, the set of models is usually organized to form 
a Markov meta-model in which there are transition probabilities between the 
models. This is known as a Hidden Markov Model (HMM). In this context, 
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the models are usually called states, but for the sake of consistency I keep the 
same vocabulary as before. In an HMM, a run of models is a Markov pro- 
cess with a probability, and, given a run of models, the sequence has a like- 
lihood. If a segment is defined as a range of positions modelled by a unique 
model, it is possible to compute the probability of a segmentation given the se- 



quence and the HMM. As this method permits efficient ( i.e. with 



plexity) algorithms for sequence analysis and partitioning (IRab iner. 



used in numerous applications, for example i n bioinformatics ([C hurchill 



Baldi et al. 



Nico 



1996). 



994 



ukashin and Borodovsky 



998 



inear com- 



Peshkin and Gelfand 



1989), it is 



1989; 



1999 



as et all 120021 ; iBoy s and Hendersonl . l2004|) and in speech recognition (lOstendorf et al. . 



However, since in an HMM the chain of the models is markovian, the lengths 
of the segments defined by the models are expected to follow geometric laws, 
which may be a false hypothesis for real data segments. Various solutions have 
been proposed to overco me this problem, such as usin g semi-Markov chains (|Guedonl . 



20051) or macro-states (Eph raim and Merhavl . 120021) . but in fact they make the 



modelling task more complex, since more parameters are used to obtain a better 
modelling of the lengths of the segments. Moreover, in the problem of sequence 
segmentation using a set of models, the inter-model transition probabilities used 



in an HMM correspond to an a priori on the distributions of the segments, and 
are superfluous parameters if we consider that the models themselves should be 
sufficient to segment the sequence, as in the approach described below. Finally, 
in an HMM, the models modelling and the length modelling can be seen as two 
competing modellings, because in the parts of the sequence where the models do 
not discriminate clearly, the length parameters will have a predominant influence. 
This is even more problematic when the lengths of the real segments are very 
different along the sequence. 

A way to avoid these "extra" parameters is to establish a homogeneity criterion 
for a segment (such as the variance of its composition, or its maximum likelihood 
given specific models), and to determine a set of segments that divide the se- 
quence and minimize - or maximize - this criterion. This problem - als o known 



as th e changepoint problem - can be solved by an optimal algorithm (|Bellman , 



196 II) . but its time-complexity is quadratic with the length of the sequence, which 



prohibits the analysis of very long sequences. Alt ernative 



be tackled linearly using hierarchical segmentation (ILi et al 



y, this problem can 



2002 



Li 



2001 kor 



1993 



with approx i matio ns about the limits of the segments (|Barry and Hartigar 
Braun et all |2000J), but these approaches do not ensure that the best partition is 



found. Moreover, when the homogeneity criterion is monotonous with the num- 



ber of segments (such as the maximum likelihood of markovian processes), these 



methods need an additional criterion to stop the segmentation process. For each 
number of segments, the calculation of the criterion is based on the built partition 
and is very dependent on the choice of this partition. Without a stopping criterion, 
these methods produce multi-level descriptions of the structure of the sequences 
that may be quite interesting, but I am not aware of any practical usage of such 
sets of segmentations. 



I 

Between those two approaches, I described in (IGueguen , 



200 II) an algorithm 



- known as MPP, or Maximal Predictive Partitioning - that computes the most 
likely segmentation of a sequence in k segments given a set of Markov models. 
This algorithm is optimal and has a time-complexity linear with the length of the 
sequence. As with the previous segmentation methods, it provides a multi-level 
description of the structure of a sequence, and it needs an additional criterion to 
select the "best" partition, such as the number of segments. 

Bayesian methods are a different approach to work on sequence segmentation, 
since they propose to simulate t he a posteriori distribution of the segmentations 



of a sequence 



Mak eev et all 



2001 



Keithl 



given a criterion ([Liu and Lawrence! . 



1999; 



Salmenkivi et al. . 



2002; 



2006Q . Even though they do not construct the best 



segmentation, they indicate the relative significance of the segmentations, and 



the structuring of the sequence. Nonetheless, as the set of segmentations is very 
large, the convergence of the simulated distribution towards the right one can be 
extremely slow. 

I would now like to look at the problem of estimating the structuring of a se- 
quence given a set of Markov models. In contrast to the situation for an HMM, I do 
not want to put any constraint on the transitions between the models. This article 
presents an algorithm that computes the moments of the likelihood of a sequence 

he maximum of 



under the set of all partitions with a given number of segments, 
this likelihood was already computable with the MPP algorithm ( Gueguenl . l2001[) 
Since the time-complexity of this new algorithm is linear with the length of the 
sequence, it can also be applied to very long sequences. 

The distribution of this likelihood may be useful for many statistical analy- 
ses of sequences, for example in an HMM modelling to test for the relevance of 
inter-model transition probabilities, or in a change point problem to test the signif- 
icance of partitions and stop the partitioning, or in a bayesian approach to perform 
more efficient simulations of the a posteriori distribution of the segmentations of 
a sequence. As an example, I propose a maximum a posteriori estimator of the 
numbers of segments in a sequence. 
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2 Method 



2.1 Computing the likelihood of the sequence 

The method computes the moments of the likelihood that a sequence is parti- 
tionned in exactly k segments given a set of Markov models. The algorithm that 
is presented permits the computation of the mean of this distribution. Generalizing 
this to the computation of all moments is straightforward. 

First, we introduce some notations and concepts. 

The studied sequence, S, consists of letters, and has a length /. For all % e 
[0, / — 1], we denote by Sj the i-th letter of S, and Si the segment of S from 
position to position i, inclusive. S = Si-±. 

A /c-partition is a partition in k segments. A predictive /c-partition is a k- 
partition in which a model is associated with each segment, and neighbouring 
segments have different models. The set of the predictive k -partitions of S is 
denoted ¥ k . From here on, all partitions will be predictive partitions. 

Let us call the set of models D; for all d G D we denote by Tr d (i) = pr(sj|«Sj_i, d) 
the probability of the i-th letter given the model d and the previous % — 1 letters 
of the sequence. The likelihood of a segment a C S given a model d E D is the 
product of the likelihoods of its letters pr(a\d) = Yi Si ecr n d(i)- For p in F k , the 
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likelihood of S given p, pr(<S|p, D), is the product of the likelihoods of the pre- 
dictive segments of S defined by the partition. We have defined a distribution of 
the likelihoods over ¥ k , (pr(«S|p, D)) pePk , and we are looking for the expectation 
of this distribution pr(S\F k , D) = J2 pe p k P<^k, D).pv(p\F k ). 

We denote m k (i) the expectation of the likelihoods of Si under the set of the 
A; -partitions of Si, and mf,(i) is the expectation of the likelihoods of Si under the 
set of the /^-partitions of Si whose model of the last segment is d. These values 
can be computed with a dynamic programming algorithm (the demonstration of 
which is appended): 

Vi^0,m?(i) = pr(Si\d) 
\fk ^ 1, \/i < k — 1, m k (i) = 
Vfc>l,Vi>A;-l,m fc (i) = 

* deD 

Vk^2,Vi^k-l,m d k (i) = n d (i). ^~ k . + 1 .m d k (i-l) 

As pr(Si\d) is the likelihood of a segment given a specific model, it is com- 
putable. We can see that when % — k — 1, the first term inside the brackets equals 
0, which means that m d (i) can be recursively computed. 
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For each k, pr(<S|P fe , D) = m k (l — 1) is the mean likelihood of S under the 
set of the k -predictive partitions. 

When, in the previous formula, we change n d (i) by 7r^(i), the expectation of 
the ath power of the likelihood of S, E pe p k (pr(S\p, D) a ), is computed, which is 
the a-th moment around of this distribution. When a is a natural, it is then easy 
to compute the a-th moment around the mean, such as the variance. 

This algorithm has a linear time-complexity with the product of the number 
of models and the length of the sequence. Hence these likelihoods are quite com- 
putable, even for very long sequences. 

2.2 Estimating the a posteriori probabilities 

Considering the segmentation problem, we are actually interested in the a pos- 
teriori probability of the number of segments given the sequence, say N. We 
hypothesize hereafter that the probability of this number is equal to pr(Pjv|5, D), 
even though this hypothesis deserves a closer examination. However, it is reason- 
able to assume that pr(7V|<S, D) and pr(Pjv|<S, D) have the same modes, and that 
a maximal a posteriori estimator of pr(PAr|5, D) will be a maximal a posteriori 
estimator of pr(N\S, D). 

Owing to the bayesian formula pr(P A r| l S, D) oc pr(5|Pjv, D)pr(¥ N \D), an a 
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priori on the distribution of pr(¥ N \D) has to be set. If this a priori is uniform with 
k, the a posteriori probability is directly proportional to the likelihood computed 
in the previous section: pr(PAr =fc |<S, D) oc pr(5|P fe , D). 

Another a priori is analogous to the HMM modelling: we consider that the 
segment length follows a geometrical distribution with a given mean, say A. Then 
a priori N — 1 follows a binomial distribution of parameter y, and if we define a 
random variable X ~* Bin(7, f ), pr(F N=k \S, D) oc pr(«S|P fc , £>).pr(X = fc - 1). 

A more experimental approach is to consider that pr(P7v|<S, D) follows a given 
law with some parameters, and to simulate sequences generated by /c-partitions to 
fit at best these parameters, considering an optimization criterion. An obvious 
criterion is to minimize the mean square error of the maximum a posteriori esti- 
mation of the numbers of segments. 

2.3 Implementation 

This algorithm has b een implemente d in C++, and is freely available via python 



modules in Sarment (|Gueguen . 



120051) at the URL: 
http : / / pbil . univ-lyonl . f r / software/ sarment / 

The examples of the next section are described in the tutorial at the same 
location. 
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3 Maximum a posteriori estimation 



3.1 The a priori distribution 

To build a good a posteriori estimator, we still need to look for a relevant a priori 
probability on the P&. To test this, I have generated random sequences made up 
of an alphabet of two letters (A and B), from several Markov models and random 
/c-partitions, for several values of k. We denote Bern(a) the model where the 
emission probability of an A is a (and that of a B is 1 — a). The positions of the 
limits of the segments were uniformly generated, so that each segment was at least 
50 positions long, and the models were uniformly assigned to each segment so that 
no two neighbouring segments shared the same model. For each k, 100 random 
/c-partitions and sequences 10,000 letters in length have thus been generated. To 
understand how the algorithm performs on more or less strongly segmented se- 
quences, the next examples present sequences generated from models Bern(0.3) 
and Bern(0.7), and sequences generated from more similar models Bern(0.4) and 
Bern(0.6). The same models have been used to compute the likelihoods. Fig 

First, I searched for the number of segments N for which the sequence has the 
highest likelihood. It is equivalent to the uniform a priori distribution. 

The examples of log-likelihoods in Fig. Q] show a typical behaviour: the neigh- 
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bourhood of the maximum likelihood can be reached very quickly, and there are 
several numbers of segments with a likelihood "near" this maximum. If in the left 
example, the maximum is reached on the exact number of segments, this maxi- 
mum is reached for a higher number in the right example. Fig. [2] 

Actually, overall, the predicted numbers of segments are in accordance with 
the simulated numbers (Fig. [2]). However, as the segments become more difficult 
to discriminate (when the average size of the simulated segments decreases or 
when the models generating the segments are more similar), the predicted number 
tends to over-estimate. This means that the number of segments with the highest 
likelihood is not in fact the one most relevant for this prediction, and another a 
priori than the uniform distribution should be chosen. 

The a priori can be based on the length of the segments, as it is done in 
HMM modelling. Since in the simulations the inter- segments positions of the 
random partitions were uniformly taken along the sequence, the lengths of the 
simulated segments followed a geometric distribution, which should favour the 
analysis through HMM. 

I have studied these sequences with the likelihood algorithm and with an 
HMM. The HMM used had the exact Markov models and an additional parameter 
p on the probability transitions between the states, so that the average length of 
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the segments is 1/p. To get the resulting partition, I have applied the forward- 
backward algorithm on the sequences and successive positions were clustered in 
a segment when their most likely state was identical. Since the sequences were 
10,000 letters long, the number of predicted segments minus one follows the bi- 
nomial law Bin(9999,p). I used p = 0.001 (10 segments) and p = 0.005 (50 
segments), and again models Bern(0.3) versus Bern(0.7) and Bern(0.4) versus 
Bern(0.6) (Fig.d. Fig. [3] 

Figure [3] shows that when the models are distant (Bern(0.3) versus Bern(0.7)), 
the forward-backward algorithm performs rather well. However, with p = 0.005 
the number of segments is more over-estimated than with p = 0.001, since it tends 
to increase the number of segments. When the models are less different, as with 
Bern(0.4) versus Bern(0.6), the influence of p becomes critical. In this example, 
p = 0.001 under-estimates the number of segments when the real number is over 
10, since this parameter means that a priori on average the sequence has 10 seg- 
ments. With p = 0.005 the predictions over-estimate slightly for small numbers 
of segments, and they tend to under-estimate as the real number increases. 

We can see that when this estimator is biased, the bias depends on the value 
of the inter- state probability and on the real number of segments in the sequence, 
which is not known beforehand. Fig. |4] 
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Nonetheless, to study the effect of this a priori on the maximum a posteriori 
estimator, I have set the same binomial a priori distribution on pr(Pjv|-D), with 
p = 0.001 and p = 0.005. We can see in Fig. |4] the same behaviour as with the 
HMM modelling, but with a much more important over-estimation of the number 
of segments when p = 0.005. It means that the tendency of this a priori to "drag" 
the maximum a posteriori towards 50 segments is here more influential. When 
the real number of segments is near 50, the over-estimation is lower than in Fig.[2l 
for the same reason. Then, even though it corresponds to the modelling of HMM, 
a binomial a priori is not relevant for maximum a posteriori estimation of the 
number of segments. 

An experimental way to set up an a priori distribution is to define it through a 
set of parameters, that will be optimized by simulations. The optimization func- 
tion is the minimization of the mean square error between the maximum a poste- 
riori estimation and the real numbers k of segments, summed for all k from 1 to 
50. 

A first way would be to optimize the parameter of the binomial a priori dis- 
tribution. Indeed, the poor efficiency of these examples could be due to a bad pa- 
rameter value. In these simulations, the optimal value p is 0.00098 (resp. 0.0021) 
for the models Bern(0.3) versus Bern(0.7) (resp. Bern(0.4) versus Bern(0.6)). 
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The first optimization is quite efficient (Fig. [5]) but when the segments are less dif- 
ferent there is an over-estimation of the number of segments for small k, and an 
under-estimation for large k, as in the previous section. The correct estimations 
are around 25 segments, a balance between over-estimating and under-estimating 
all the k between 1 and 50. Then even with an optimization process, a binomial a 
priori does not give an efficient a posteriori estimator. Fig. [5] 

I tried the same kind of optimization with a geometric a priori distribution 
Q{9): pr(PAr=fc|.D, S) oc pr(«S|Pfc, D).6 k . I have performed twice the same round 
of sequence simulations as before, one set for the optimization of the parame- 
ter, and one set to test it on the obtained estimator. On these examples, when 
the models are distant enough, as in Bern(0.3) versus Bern(0.7), the estimator is 
quite accurate, and it is unbiaised, even with Bern(0.4) versus Bern(0.6) models 
(Fig.©. Fig. i 

This example shows that this approach can give good results, even though it 
is up to now only experimental. A theoretical study may be useful to set up an 
even more efficient a priori, and to prevent the cost of simulations as well as the 
numerical optimization process. We can expect this distribution to depend on the 
set of models and on the length of the sequence, and it would be quite interesting 
to study it thoroughly. 
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3.2 CpG islands 

In vertebrate genomes, CpG dinucleotides are mostly methylated and this methy- 
lation entails an hypermutability of these nucleotides, from CpG to TpG or CpA. 
A usual measure of this feature is to compute the ratio of the observed CpG dinu- 
cleotides over the expected number when the nucleotides are independent: 

_ „, , frequency of CpG 

CpGo/e = 

frequency of C x frequency of G 

In some stretches of DNA, known as CpG islands, the CpG dinucleot ides are 



hypom ethylated. These islands are often associated with promoter regions ([Ponger et al. , 
20011) . They show a higher CpGo/e than surrounding sequences, at least 0.6. 
Moreover, a CpG island is expected to be at least 300 bases long. I wanted to seg- 
ment a sequence of the mouse genome to reveal the occurences of CpG islands. 
The CpGo/e ratio on this sequence is shown in 1 ,000 bases sliding windows (Fig. [7] 
middle). Fig. [7] 

Fig. m 



As described by 



Durbin et al. 



1998), I defined two first-order Markov models, 
built by maximum likelihood on known data: the first is trained on CpG islands, 
and the other on segments that are between the CpG islands. I used those mod- 
els to compute the segmentation likelihood on a sequence of the mouse genome 
(Fig. [8]), for up to 50 segments. 
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I looked for the maximum a posteriori estimator of the number of segments, 
with a geometric a priori distribution, and I simulated random sequences through 
the same process as described in section [231 The optimization of the maximum a 
posteriori estimator gives 9 = 0.546, and the result of this optimization is shown 
in Fig. [8] We can see that this estimator is still unbiased until 50 segments, and 
quite precise. 

With this a priori, the maximum a posteriori estimator on the mouse sequence 
gives 17 segments, and CpG-islands predicted in the most likely 17-partition are 
shown in Fig.[7]bottom. 
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4 Discussion 

In this article, I propose an algorithm to compute the moments of the likelihood of 
segmentation of a sequence in a number of segments, given a set of Markov mod- 
els. This algorithm has a time-complexity linear with the length of the sequence 
and the number of models, and it can be used on very long sequences. 

From this likelihood, it should be possible to compare the numbers of seg- 
ments to partition a sequence, either through statistical tests or through a bayesian 
approach. In a bayesian approach, the a priori distribution of the numbers of 
classes must be defined, and I give some examples where a geometric a priori dis- 
tribution gives a quite precise maximum a posteriori estimator. This has been only 
validated with simulations, and a full theorerical study is yet to be undertaken on 
the a priori distribution. Moreover, it would be quite interesting to define some 
statistical tests to assess the relative significance - confidence intervals and p- 
values - of the numbers of segments, given the models and the sequence. The fact 
that the moments of the distribution of the likelihood can be computed could be 
useful for this, as well as for an improvement of the previous estimator. 

This algorithm does not put any constraint on the succession of models, but 
works as if the transition graph between the models were a clique. It is easy 
to see from the Appendix that it can be adapted to any kind of transition graph, 

19 



which means that it may be useful in the context of HMM analysis, for example 
to check - or determine - the inter-model probabilities of the models, given a 
sequence. In this context, it could also be interesting to use the likelihood to 
enhance the efficiency of methods related to HMM modelling, for example for 
post-analysis of forward-backward algorithm. As in HMM modelling, one aim 
would be to compute the probability that a position is predicted by a model, given 
a set of models, and possibly given a number of segments. If the MPP algorithm 
is equivalent to the Viterbi algorithm for HMM, computing this probability would 
be the equivalent of the forward-backward algorithm in this context. 

Even if model inference is out of the topic of this article, it is a very impor- 
tant feature in sequence ana lysis, and it will be interesting to use the likelihood 



for this. In iPolanskyl (120071) . there is an example of inference of Markov models 
from a sequence, out of the context of HMM, but it is practically limited with the 
numbers of segments in the sequence and, since it uses the maximum likelihood, 
an additional penalization criterion (AIC or BIC) is necessary to handle this num- 
ber. It should be possible to use the calculation of the average likelihood to get 
rid of these problems. Another inference process is the maximization, among a 
set of models, of the average likelihood. Moreover, it would be relevant to use 
the bayesian approach to estimate and simulate a posteriori probabilities for the 
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parameters of the models, given the sequence. 

Finally, as I said in the introduction, to my knowledge multi-level segmenta- 
tions of sequences are not used for sequence analysis, although its relevance. An 
important barrier to this is the lack of evaluation criteria for these levels. Com- 
puting the likelihood for the successive numbers of segments may then be a quite 
useful tool to develop this kind of methodology. It would bring out a much richer 
modelling of the sequence. 
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Appendix 



Here is a demonstration of the formula described in section [2l keeping the same 
notations: 
We define 

Pjfc(i) the set of the /c-partitions of Si 

Pf(i) the set of the /^-partitions of Si whose model of the last segment is d. 

rrik(i) is the likelihood of Si under Pfc(i), V/c ^ 0,Vz ^ — l,mfe(i) = 
pr(5i|P fc (i)), and mg(i) is the likelihood of S(i) under Pg(i), VA; > 0, Vz > fc - 



We follow a bayesian approach, in which, for each k, all the /^-partitions are 
equiprobable in P fe . 

If we note d p (i) the model used in partition p at position z, we have for all 



lX«=Pr(Si|P* 



W) 



If the a priori on the last model d is uniform: 



m k (i) = pr(5 i |P fc (i))=X) m fc( i )-PrOPfc(i)|P*(i)) 




(1) 
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k ^ 2 and i ^ k — 1 , 

rofti) = pr(5i|P-(<)) 

= £ pr(5 i b).pr(p|P^(i)) 

peP fc d (i) 

P eP£(i) 

= Jx(S i \p).#Vi(i)- 1 + Yl pr(^|p).#P2(<) _1 

d p (i— l)=d d p (i—l)ytd 

If p G Pfe(i) and — 1) = d, p is like a /c-partition p' of <Sj_i whose last 
model, d, is used to emit Sj. So pr(<Sj|p) = 71^(2) .pr(«Si_i|p') withp' e Pf(i - 1). 

If p G Pfe(i) and — 1) ^ d, pis like a A; — 1-partition p' of <Sj_i whose 
last model, cf, is different from d. So pr(<Sj|p) = 7Td(i).pr(«Si_i|p') with p' G 

Hence 

m d k (i) = K d {i). I £ prte-xb^Pfti)- 1 
\peP^(i-i) 

+ E E PtC^-iW^PJKO- 1 (2) 
dVdpgP^^i-i) y 

In a partition of (i), the last model is rf, the one before any of the j^D — 1 other 

ones, and so on for the k — 2 remaining models. So there are (#-D — possible 

sets of models for this partition. Moreover, the limits of the segments are defined 
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by k — 1 positions in the % possible, so there are C\ 1 possible sets of positions. 
So 

#Pjfti) = Cf- 1 (# J D-l) fc - 1 

(#£> - 1) 



and 



l\ ^ fe _! 



(jfe-l)!(i-A; + l)! 

-IT!**" 1 ' 



= ^(#£> - - 1) 
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If we replace #PjJ(i) in ©: 



in 



|j n c-i_n^. : a-K^u — i.) 1 

pe pd(i_i) 



+E E p^- 1 b).^|^ TT #pt 1 ^-i)- 1 

*d(i). [ l ' k + l E pr(<Vi|p).pr(pbGP^-l)) 



\ 

• jJ^-n S E P^-iN.prbbGPf^^-l)) 
— -pr^P^-l)) 



fc-1 



^^^(i-l))] 



i(#D-l) 

ml(i) = 7T d (i). \ ^~ k . + 1 .m d k {i - 1) + ^ E m fe~ l(i ~ ^ 

And to make the algorithm faster, from ©, 



£ m£.!(i - 1) = ^D.m k ^(i - 1) - m^i - 1) 
gives the formula. 
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Bern(0.4) versus Bern(0.6) 
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Figure 1: Log-likelihood of two random sequences generated by 30 segments 



from two models. The dashed vertical line represents 30 segments. 
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Figure 2: Boxplots of the numbers of segments N reaching the maximum likeli- 
hood of the sequence, for a simulated number of segments k between 1 and 50. 
The oblique line represents the right number of segments (N = k). 
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Bern(0.3) versus Bern(0.7) with p=0.001 ^ Bern(0.4) versus Bern(0.6) with p=0.001 
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Bern(0.3) versus Bern(0.7) with p=0.005 ^ Bern(0.4) versus Bern(0.6) with p=0.005 
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Figure 3: Boxplots of the numbers of segments iV predicted by the forward- 
backward algorithm, for a simulated number of segments k between 1 and 50. 
The oblique line represents the right number of segments (N = k). 
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Bern(0.3) versus Bern(0.7) with p=0.001 Bern(0.4) versus Bern(0.6) with p=0.001 
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Figure 4: Boxplots of the numbers of segments iV with the maximum a posteriori 
probability, with a binomial a priori distribution on pr(Pjvl-D), for a simulated 
number of segments k between 1 and 50. The oblique line represents the right 
number of segments (N = k). 
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Bern(0.3) versus Bern(0.7) with p = 0.000879 Bern(0.4) versus Bern(0.6) with p = 0.002109 
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Figure 5: Boxplots of the numbers of segments iV with the maximum a posteriori 
probability, with a binomial a priori distribution on pr(PAr|Z)) and an optimized 
parameter p, for a simulated number of segments k between 1 and 50. The oblique 
line represents the right number of segments (N = k). 
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Bern(0.3) versus Bern(0.7) with 9 = 0.295 Bern(0.4) versus Bern(0.6) with 6 = 0.70 1 
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Figure 6: Boxplots of the numbers of segments N with the maximum a posteriori 
probability, using an a priori distribution Q{0) with an optimized 6, for a simulated 
number of segments k between 1 and 50. The oblique line represents the right 
number of segments (N = k). 
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Figure 7: Analysis of the CpG islands from a Mouse genomic sequence. The 
sequence is shown on the x-axis. Top: Partitioning up to 30 segments. A row of 
arcs labelled by a number k represents the best A; -partition (only even numbers 
are shown, for clarity). Each arc represents a segment. On each row, the relative 
height of an arc corresponds to the ratio3^pGo/e on the segment. Middle: CpGo/e 
in 1,000 bases sliding windows. Bottom: Predicted CpG islands of the best 17- 
partition of the sequence. 
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Figure 8: Study of a mouse sequence under the set of the A; -partitions, given the 
CpG island vs non-Cpg island models. Left: Log-likelihood of the sequence, for 
k numbers of segments, with k between 1 and 50. Right: Boxplots of the numbers 
of segments N with the maximum a posteriori probability, with a geometric a 
priori distribution £(0.546), for a simulated number of segments k between 1 and 
50. The simulated sequences were the same length than the studied one (176973), 
and the segments were at least 300 long. The oblique line represents the right 
number of segments (N = k). 
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